136 research outputs found
Energy Bounds for Fault-Tolerant Nanoscale Designs
The problem of determining lower bounds for the energy cost of a given
nanoscale design is addressed via a complexity theory-based approach. This
paper provides a theoretical framework that is able to assess the trade-offs
existing in nanoscale designs between the amount of redundancy needed for a
given level of resilience to errors and the associated energy cost. Circuit
size, logic depth and error resilience are analyzed and brought together in a
theoretical framework that can be seamlessly integrated with automated
synthesis tools and can guide the design process of nanoscale systems comprised
of failure prone devices. The impact of redundancy addition on the switching
energy and its relationship with leakage energy is modeled in detail. Results
show that 99% error resilience is possible for fault-tolerant designs, but at
the expense of at least 40% more energy if individual gates fail independently
with probability of 1%.Comment: Submitted on behalf of EDAA (http://www.edaa.com/
Hardware-Aware Machine Learning: Modeling and Optimization
Recent breakthroughs in Deep Learning (DL) applications have made DL models a
key component in almost every modern computing system. The increased popularity
of DL applications deployed on a wide-spectrum of platforms have resulted in a
plethora of design challenges related to the constraints introduced by the
hardware itself. What is the latency or energy cost for an inference made by a
Deep Neural Network (DNN)? Is it possible to predict this latency or energy
consumption before a model is trained? If yes, how can machine learners take
advantage of these models to design the hardware-optimal DNN for deployment?
From lengthening battery life of mobile devices to reducing the runtime
requirements of DL models executing in the cloud, the answers to these
questions have drawn significant attention.
One cannot optimize what isn't properly modeled. Therefore, it is important
to understand the hardware efficiency of DL models during serving for making an
inference, before even training the model. This key observation has motivated
the use of predictive models to capture the hardware performance or energy
efficiency of DL applications. Furthermore, DL practitioners are challenged
with the task of designing the DNN model, i.e., of tuning the hyper-parameters
of the DNN architecture, while optimizing for both accuracy of the DL model and
its hardware efficiency. Therefore, state-of-the-art methodologies have
proposed hardware-aware hyper-parameter optimization techniques. In this paper,
we provide a comprehensive assessment of state-of-the-art work and selected
results on the hardware-aware modeling and optimization for DL applications. We
also highlight several open questions that are poised to give rise to novel
hardware-aware designs in the next few years, as DL applications continue to
significantly impact associated hardware systems and platforms.Comment: ICCAD'18 Invited Pape
Layer-compensated Pruning for Resource-constrained Convolutional Neural Networks
Resource-efficient convolution neural networks enable not only the
intelligence on edge devices but also opportunities in system-level
optimization such as scheduling. In this work, we aim to improve the
performance of resource-constrained filter pruning by merging two sub-problems
commonly considered, i.e., (i) how many filters to prune for each layer and
(ii) which filters to prune given a per-layer pruning budget, into a global
filter ranking problem. Our framework entails a novel algorithm, dubbed
layer-compensated pruning, where meta-learning is involved to determine better
solutions. We show empirically that the proposed algorithm is superior to prior
art in both effectiveness and efficiency. Specifically, we reduce the accuracy
gap between the pruned and original networks from 0.9% to 0.7% with 8x
reduction in time needed for meta-learning, i.e., from 1 hour down to 7
minutes. To this end, we demonstrate the effectiveness of our algorithm using
ResNet and MobileNetV2 networks under CIFAR-10, ImageNet, and Bird-200
datasets.Comment: 11 pages, 8 figures, work in progres
Increased Scalability and Power Efficiency by Using Multiple Speed Pipelines §
One of the most important problems faced by microarchitecture designers is the poor scalability of some of the current solutions with increased clock frequencies and wider pipelines. As several studies show, internal processor structures scale differently with decreasing device sizes. While in some cases the access latency is determined by the speed of the logic circuitry, for others it is dominated by the interconnect delay. Furthermore, while some stages can be super-pipelined with relatively small performance loss, others must be kept atomic. This paper proposes a possible solution to this problem, avoiding the traditional trade-off between parallelism and clock speed. First, allowing instructions to enter and leave the Issue Window in an asynchronously manner enables faster speeds in the front-end at the expense of small synchronization latencies. Second, using an Execution Cache for storing instructions that are already scheduled allows for bypassing the issue circuitry and thus clocking the execution core at higher frequencies. Combined, these two mechanisms result in a 50 % to 60 % performance increase for our test microarchitecture, without requiring a completely new scheduling mechanism. Furthermore, the proposed microarchitecture requires significantly less energy, with 30% reduction in a 0.13um or 20 % in a 0.06um process technology over the original baseline. 1
Thread Progress Equalization: Dynamically Adaptive Power and Performance Optimization of Multi-threaded Applications
Dynamically adaptive multi-core architectures have been proposed as an
effective solution to optimize performance for peak power constrained
processors. In processors, the micro-architectural parameters or
voltage/frequency of each core to be changed at run-time, thus providing a
range of power/performance operating points for each core. In this paper, we
propose Thread Progress Equalization (TPEq), a run-time mechanism for power
constrained performance maximization of multithreaded applications running on
dynamically adaptive multicore processors. Compared to existing approaches,
TPEq (i) identifies and addresses two primary sources of inter-thread
heterogeneity in multithreaded applications, (ii) determines the optimal core
configurations in polynomial time with respect to the number of cores and
configurations, and (iii) requires no modifications in the user-level source
code. Our experimental evaluations demonstrate that TPEq outperforms
state-of-the-art run-time power/performance optimization techniques proposed in
literature for dynamically adaptive multicores by up to 23%
Learning-based Application-Agnostic 3D NoC Design for Heterogeneous Manycore Systems
The rising use of deep learning and other big-data algorithms has led to an
increasing demand for hardware platforms that are computationally powerful, yet
energy-efficient. Due to the amount of data parallelism in these algorithms,
high-performance 3D manycore platforms that incorporate both CPUs and GPUs
present a promising direction. However, as systems use heterogeneity (e.g., a
combination of CPUs, GPUs, and accelerators) to improve performance and
efficiency, it becomes more pertinent to address the distinct and likely
conflicting communication requirements (e.g., CPU memory access latency or GPU
network throughput) that arise from such heterogeneity. Unfortunately, it is
difficult to quickly explore the hardware design space and choose appropriate
tradeoffs between these heterogeneous requirements. To address these
challenges, we propose the design of a 3D Network-on-Chip (NoC) for
heterogeneous manycore platforms that considers the appropriate design
objectives for a 3D heterogeneous system and explores various tradeoffs using
an efficient ML-based multi-objective optimization technique. The proposed
design space exploration considers the various requirements of its
heterogeneous components and generates a set of 3D NoC architectures that
efficiently trades off these design objectives. Our findings show that by
jointly considering these requirements (latency, throughput, temperature, and
energy), we can achieve 9.6% better Energy-Delay Product on average at nearly
iso-temperature conditions when compared to a thermally-optimized design for 3D
heterogeneous NoCs. More importantly, our results suggest that our 3D NoCs
optimized for a few applications can be generalized for unknown applications as
well. Our results show that these generalized 3D NoCs only incur a 1.8%
(36-tile system) and 1.1% (64-tile system) average performance loss compared to
application-specific NoCs.Comment: Published in IEEE Transactions on Computer
Task Scheduling for Heterogeneous Multicore Systems
In recent years, as the demand for low energy and high performance computing
has steadily increased, heterogeneous computing has emerged as an important and
promising solution. Because most workloads can typically run most efficiently
on certain types of cores, mapping tasks on the best available resources can
not only save energy but also deliver high performance. However, optimal task
scheduling for performance and/or energy is yet to be solved for heterogeneous
platforms. The work presented herein mathematically formulates the optimal
heterogeneous system task scheduling as an optimization problem using queueing
theory. We analytically solve for the common case of two processor types, e.g.,
CPU+GPU, and give an optimal policy (CAB). We design the GrIn heuristic to
efficiently solve for near-optimal policy for any number of processor types
(within 1.6% of the optimal). Both policies work for any task size distribution
and processing order, and are therefore, general and practical. We extensively
simulate and validate the theory, and implement the proposed policy in a
CPU-GPU real platform to show the optimal throughput and energy improvement.
Comparing to classic policies like load-balancing, our results range from
1.08x~2.24x better performance or 1.08x~2.26x better energy efficiency in
simulations, and 2.37x~9.07x better performance in experiments.Comment: heterogeneous systems; scheduling; performance modeling; queueing
theor
NeuralPower: Predict and Deploy Energy-Efficient Convolutional Neural Networks
"How much energy is consumed for an inference made by a convolutional neural
network (CNN)?" With the increased popularity of CNNs deployed on the
wide-spectrum of platforms (from mobile devices to workstations), the answer to
this question has drawn significant attention. From lengthening battery life of
mobile devices to reducing the energy bill of a datacenter, it is important to
understand the energy efficiency of CNNs during serving for making an
inference, before actually training the model. In this work, we propose
NeuralPower: a layer-wise predictive framework based on sparse polynomial
regression, for predicting the serving energy consumption of a CNN deployed on
any GPU platform. Given the architecture of a CNN, NeuralPower provides an
accurate prediction and breakdown for power and runtime across all layers in
the whole network, helping machine learners quickly identify the power,
runtime, or energy bottlenecks. We also propose the "energy-precision ratio"
(EPR) metric to guide machine learners in selecting an energy-efficient CNN
architecture that better trades off the energy consumption and prediction
accuracy. The experimental results show that the prediction accuracy of the
proposed NeuralPower outperforms the best published model to date, yielding an
improvement in accuracy of up to 68.5%. We also assess the accuracy of
predictions at the network level, by predicting the runtime, power, and energy
of state-of-the-art CNN architectures, achieving an average accuracy of 88.24%
in runtime, 88.34% in power, and 97.21% in energy. We comprehensively
corroborate the effectiveness of NeuralPower as a powerful framework for
machine learners by testing it on different GPU platforms and Deep Learning
software tools.Comment: Accepted as a conference paper at ACML 201
Towards Efficient Model Compression via Learned Global Ranking
Pruning convolutional filters has demonstrated its effectiveness in
compressing ConvNets. Prior art in filter pruning requires users to specify a
target model complexity (e.g., model size or FLOP count) for the resulting
architecture. However, determining a target model complexity can be difficult
for optimizing various embodied AI applications such as autonomous robots,
drones, and user-facing applications. First, both the accuracy and the speed of
ConvNets can affect the performance of the application. Second, the performance
of the application can be hard to assess without evaluating ConvNets during
inference. As a consequence, finding a sweet-spot between the accuracy and
speed via filter pruning, which needs to be done in a trial-and-error fashion,
can be time-consuming. This work takes a first step toward making this process
more efficient by altering the goal of model compression to producing a set of
ConvNets with various accuracy and latency trade-offs instead of producing one
ConvNet targeting some pre-defined latency constraint. To this end, we propose
to learn a global ranking of the filters across different layers of the
ConvNet, which is used to obtain a set of ConvNet architectures that have
different accuracy/latency trade-offs by pruning the bottom-ranked filters. Our
proposed algorithm, LeGR, is shown to be 2x to 3x faster than prior work while
having comparable or better performance when targeting seven pruned ResNet-56
with different accuracy/FLOPs profiles on the CIFAR-100 dataset. Additionally,
we have evaluated LeGR on ImageNet and Bird-200 with ResNet-50 and MobileNetV2
to demonstrate its effectiveness. Code available at
https://github.com/cmu-enyac/LeGR.Comment: CVPR 2020 Ora
Regularizing Activation Distribution for Training Binarized Deep Networks
Binarized Neural Networks (BNNs) can significantly reduce the inference
latency and energy consumption in resource-constrained devices due to their
pure-logical computation and fewer memory accesses. However, training BNNs is
difficult since the activation flow encounters degeneration, saturation, and
gradient mismatch problems. Prior work alleviates these issues by increasing
activation bits and adding floating-point scaling factors, thereby sacrificing
BNN's energy efficiency. In this paper, we propose to use distribution loss to
explicitly regularize the activation flow, and develop a framework to
systematically formulate the loss. Our experiments show that the distribution
loss can consistently improve the accuracy of BNNs without losing their energy
benefits. Moreover, equipped with the proposed regularization, BNN training is
shown to be robust to the selection of hyper-parameters including optimizer and
learning rate
- …